Dyr og Data

Statistical thinking — exploratory data analysis

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2024-09-19

Barcharts

Bar charts are a simple way of showing & comparing means or other values between groups of data

Top of the box is drawn at the data values you wish to plot, bottom at 0

Dotplots

However, barcharts are wasteful & often don’t focus on the real information you want to convey

Dotplots show the same information but focus on the difference in the variable plotted

Incorporating variability

Common to show a measure of spread or variability in dotplots or barcharts

Standard errors of means or confidence intervals usually more useful than standard deviations

  • Standard deviation \(\hat{\sigma}\)
  • Standard error of the \(\overline{y}\)
  • 1 - \(\alpha\) confidence interval for \(\overline{y}\)
    • \(\alpha\) = 0.05 (95% CI)
    • Large samples; \(\mathrm{se}_{\overline{y}}\)
    • Small samples; \(t_{n-2_{1-\alpha}} \; \mathrm{se}_{\overline{y}}\)
  • dynamite plots

Boxplots

A boxplot is a useful summary of samples with \(n > 8\) observations

Shows the median, the upper and lower hinges & whiskers, plus any observations that lie beyond the whiskers

  • Median is robust to outliers
  • Hinges are ~ lower & upper quartiles
  • Whiskers extend to at most \(\pm 1.58 \times IQR / \sqrt{n}\)
  • Points beyond whiskers are shown individually; not outliers

Boxplots

Useful to Compare observations from 2 or more groups

  • Width of box is proportional to \(n\)
  • Notches around median are ~ 95% confidence intervals for difference of medians
  • If notches overlap, fail to reject null hypothesis that observations drawn from populations with same median

Histograms

Histograms; a graphical representation of the frequency or density distribution of data

  • Data are assigned to determined bins
  • Values of a discrete variable usually determine the bins
  • Number of bins is determined by the bin width
  • Area under a histogram sums to 1 for histogram showing densities
  • Often, histograms drawn as frequency histograms; area no longer sums to 1

Histograms

Choosing the bin width is a critical step in drawing a histogram1

  • \(k = \sqrt{n}\)
  • Sturge’s: \(k = \frac{\mathrm{range}(x)}{(\log_2n + 1)}\)
  • Scott’s: \(k = \frac{3.49\hat{\sigma}}{n^{1/3}}\)
  • Freedman-Diaconis: \(k = 2 \frac{\mathrm{IQR}(x)}{n^{1/3}}\)

\(\hat{\sigma}\) is the estimated SD, \(n\) number of observations, & \(\mathrm{IQR}\) the interquartile range

Quantile-Quantile plots

Quantile-quantile (QQ) plots are useful to determine if a sample is normally distributed

Draws quantiles of the data & a reference distribution. If normally distributed, points should fall on line through upper & lower quartiles of both distributions

100 random draws from a \(t_3\) distribution — heavy tails compared to normal

Scatterplots

Thus far dealt only with univariate data displays. Other displays needed for bivariate and multivariate data

A scatterplot displays the relationship between two variables, \(x\) and \(y\) say

Each point on the plot represents the value of \(x_i\) and \(y_i\) for a single observation \(i\)

Simpson’s paradox

Important to plot data so you aren’t surprised when you model it

Violin plots

Violin plots can be thought of as a combination of a boxplot and a density plot

Raincloud plots

Boxplots and violin plots can be criticised because they don’t show the data

Raincloud plots are an alternative that does — the dots are binned like a histogram

Dotplots

Dotplots are a related graph that shows the data

Beeswarm plots

Beeswarm plots show all the data in a compact deisplay and avoid points overlapping

Transformations

Linear least squares regression makes some strong assumptions about your data; often these don’t hold

All hope abandon ye who enter here?

Often a transformation of the data can make them follow the assumptions more closely — though there are better ways

Transformation also play an important role in EDA

Transformations

Powers & roots are a useful set of transformations: \(x \rightarrow x^p\)

If \(p\) is positive have a power transformation; \(p\) negative we have an inverse power

If \(p\) is a fraction we have a root transformation

  • p = 2: power transformation
  • p = -1: power (inverse) transformation
  • p = 0.5: square root transformation
  • p = 1/3: cube root transformation
  • p = 0: convention dictates this is the log transformation

Transforming skewness

Highly skewed distributions are difficult to explore because most of the data is scrunched up at one end of the distribution

Descend ladder of powers & roots to \(\mathsf{\log(x)}\), pulls in right tail. Ascending the ladder does the opposite; pulls in left tail is data negatively skewed

Transforming nonlinearity

Transformation can help render many types of nonlinear relationships roughly linear

Clear that we are thinking about pairs of variables here; \(x\) and \(y\)

Need to choose whether to transform \(y\), \(x\), or both?

Transforming nonlinearity

New York air quality data & various transformation to linearise bivariate relationship